Unsupervised Learning of Patterns in Data Streams Using Compression and Edit Distance

نویسندگان

  • Sook-Ling Chua
  • Stephen R. Marsland
  • Hans W. Guesgen
چکیده

Many unsupervised learning methods for recognising patterns in data streams are based on fixed length data sequences, which makes them unsuitable for applications where the data sequences are of variable length such as in speech recognition, behaviour recognition and text classification. In order to use these methods on variable length data sequences, a pre-processing step is required to manually segment the data and select the appropriate features, which is often not practical in real-world applications. In this paper we suggest an unsupervised learning method that handles variable length data sequences by identifying structure in the data stream using text compression and the edit distance between ‘words’. We demonstrate that using this method we can automatically cluster unlabelled data in a data stream and perform segmentation. We evaluate the effectiveness of our proposed method using both fixed length and variable length benchmark datasets, comparing it to the Self-Organising Map in the first case. The results show a promising improvement over baseline recognition systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Learning of Human Behaviours

Behaviour recognition is the process of inferring the behaviour of an individual from a series of observations acquired from sensors such as in a smart home. The majority of existing behaviour recognition systems are based on supervised learning algorithms, which means that training them requires a preprocessed, annotated dataset. Unfortunately, annotating a dataset is a rather tedious process ...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Web mining with relational clustering

Clustering is an unsupervised learning method that determines partitions and (possibly) prototypes from pattern sets. Sets of numerical patterns can be clustered by alternating optimization (AO) of clustering objective functions or by alternating cluster estimation (ACE). Sets of non–numerical patterns can often be represented numerically by (pairwise) relations. These relational data sets can ...

متن کامل

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Extracting patterns from data streams incrementally using bounded memory and bounded time is a difficult task. Traditional metrics for similarity search such as Euclidean distance solve the problem of difference in amplitudes between static time series prior to comparison by normalizing them. However, such a technique cannot be applied to data streams since the entire data is not available at a...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011